104 research outputs found
Impact of Biases in Big Data
The underlying paradigm of big data-driven machine learning reflects the
desire of deriving better conclusions from simply analyzing more data, without
the necessity of looking at theory and models. Is having simply more data
always helpful? In 1936, The Literary Digest collected 2.3M filled in
questionnaires to predict the outcome of that year's US presidential election.
The outcome of this big data prediction proved to be entirely wrong, whereas
George Gallup only needed 3K handpicked people to make an accurate prediction.
Generally, biases occur in machine learning whenever the distributions of
training set and test set are different. In this work, we provide a review of
different sorts of biases in (big) data sets in machine learning. We provide
definitions and discussions of the most commonly appearing biases in machine
learning: class imbalance and covariate shift. We also show how these biases
can be quantified and corrected. This work is an introductory text for both
researchers and practitioners to become more aware of this topic and thus to
derive more reliable models for their learning problems
On the Reduction of Biases in Big Data Sets for the Detection of Irregular Power Usage
In machine learning, a bias occurs whenever training sets are not
representative for the test data, which results in unreliable models. The most
common biases in data are arguably class imbalance and covariate shift. In this
work, we aim to shed light on this topic in order to increase the overall
attention to this issue in the field of machine learning. We propose a scalable
novel framework for reducing multiple biases in high-dimensional data sets in
order to train more reliable predictors. We apply our methodology to the
detection of irregular power usage from real, noisy industrial data. In
emerging markets, irregular power usage, and electricity theft in particular,
may range up to 40% of the total electricity distributed. Biased data sets are
of particular issue in this domain. We show that reducing these biases
increases the accuracy of the trained predictors. Our models have the potential
to generate significant economic value in a real world application, as they are
being deployed in a commercial software for the detection of irregular power
usage
The Challenge of Non-Technical Loss Detection using Artificial Intelligence: A Survey
Detection of non-technical losses (NTL) which include electricity theft,
faulty meters or billing errors has attracted increasing attention from
researchers in electrical engineering and computer science. NTLs cause
significant harm to the economy, as in some countries they may range up to 40%
of the total electricity distributed. The predominant research direction is
employing artificial intelligence to predict whether a customer causes NTL.
This paper first provides an overview of how NTLs are defined and their impact
on economies, which include loss of revenue and profit of electricity providers
and decrease of the stability and reliability of electrical power grids. It
then surveys the state-of-the-art research efforts in a up-to-date and
comprehensive review of algorithms, features and data sets used. It finally
identifies the key scientific and engineering challenges in NTL detection and
suggests how they could be addressed in the future
Classification of concepts through products of concepts and abstract data types (abstract)
valtchev1995aInternational audienceThe classification scheme formalism represents in a uniform manner both usual data types and structured objects is introduced. It is here provided with a dissimilarity measure which only takes into account the structure of a given domain: a partial order over a set of classes. The measure we define compares a couple of individuals according to their mutual position within the taxonomy structuring the underlying domain. It is then used to design a classification algorithm to work on structured objects
Une stratégie de construction de taxonomies dans les objets
valtchev1999cNational audienceConstruire automatiquement une taxonomie de classes à partir d'objets co-définis et indiférenciables n'est pas une tâche aisée. La partition de l'ensemble d'objets en domaines et la hiérarchisation de ces domaines par la relation de composition permettent de différencier les objets et d'éviter certains cycles impliquant une relation de composition. Par ailleurs, l'utilisation d'une dissimilarité bâtie sur les taxonomies de classes existantes dans certains domaines permet d'éviter de traiter d'autres cycles. Il subsite cependant des références circulaires qui sont alors circonscrites à une partie bien identifiée des domaines
An integrative proximity measure for ontology alignment
euzenat2003hInternational audienceIntegrating heterogeneous resources of the web will require finding agreement between the underlying ontologies. A variety of methods from the literature may be used for this task, basically they perform pair-wise comparison of entities from each of the ontologies and select the most similar pairs. We introduce a similarity measure that takes advantage of most of the features of OWL-Lite ontologies and integrates many ontology comparison techniques in a common framework. Moreover, we put forth a computation technique to deal with one-to-many relations and circularities in the similarity definitions
Using FCA to Suggest Refactorings to Correct Design Defects
Design defects are poor design choices resulting in a hard-to- maintain software, hence their detection and correction are key steps of a\ud
disciplined software process aimed at yielding high-quality software\ud
artifacts. While modern structure- and metric-based techniques enable\ud
precise detection of design defects, the correction of the discovered\ud
defects, e.g., by means of refactorings, remains a manual, hence\ud
error-prone, activity. As many of the refactorings amount to re-distributing\ud
class members over a (possibly extended) set of classes, formal concept\ud
analysis (FCA) has been successfully applied in the past as a formal\ud
framework for refactoring exploration. Here we propose a novel approach\ud
for defect removal in object-oriented programs that combines the\ud
effectiveness of metrics with the theoretical strength of FCA. A\ud
case study of a specific defect, the Blob, drawn from the\ud
Azureus project illustrates our approach
Is Big Data Sufficient for a Reliable Detection of Non-Technical Losses?
Non-technical losses (NTL) occur during the distribution of electricity in
power grids and include, but are not limited to, electricity theft and faulty
meters. In emerging countries, they may range up to 40% of the total
electricity distributed. In order to detect NTLs, machine learning methods are
used that learn irregular consumption patterns from customer data and
inspection results. The Big Data paradigm followed in modern machine learning
reflects the desire of deriving better conclusions from simply analyzing more
data, without the necessity of looking at theory and models. However, the
sample of inspected customers may be biased, i.e. it does not represent the
population of all customers. As a consequence, machine learning models trained
on these inspection results are biased as well and therefore lead to unreliable
predictions of whether customers cause NTL or not. In machine learning, this
issue is called covariate shift and has not been addressed in the literature on
NTL detection yet. In this work, we present a novel framework for quantifying
and visualizing covariate shift. We apply it to a commercial data set from
Brazil that consists of 3.6M customers and 820K inspection results. We show
that some features have a stronger covariate shift than others, making
predictions less reliable. In particular, previous inspections were focused on
certain neighborhoods or customer classes and that they were not sufficiently
spread among the population of customers. This framework is about to be
deployed in a commercial product for NTL detection.Comment: Proceedings of the 19th International Conference on Intelligent
System Applications to Power Systems (ISAP 2017
- …